K-MEANS 

CLUSTERING 



INTRODUCTION- 
What is clustering? 


Clustering is the classification of objects into 
different groups, or more precisely, the 
partitioning of a data set into subsets 
(clusters), so that the data in each subset 
(ideally) share some common trait - often 
according to some defined distance measure . 







Types of clustering : 

1. Hierarchical algorithms : these find successive clusters 
using previously established clusters. 

1. Agglomerative ("bottom-up") : Agglomerative algorithms begin 
with each element as a separate cluster and merge them into 
successively larger clusters. 

2. Divisive ("top-down") : Divisive algorithms begin with the 
whole set and proceed to divide it into successively smaller 
clusters. 

2. Partitional clustering : Partitional algorithms determine all clusters at 

once. They include: 

- K -means and derivatives 

— Fuzzy c-means clustering 







Common Distance measures: 


• Distance measure will determine how the similarity of two 
elements is calculated and it will influence the shape of the 
clusters. 

They include: 

1. The Euclidean distance (also called 2-norm distance) is given by: 


2. The Manhattan distance (also called taxicab norm or 1-norm) is 
given by: 







3.The maximum norm is given by: 


d(x 9 v)=max xt—v. 

4. The Mahalanobis distance corrects data for 
different scales and correlations in the variables. 

5. Inner product space : The angle between two 
vectors can be used as a distance measure when 
clustering high dimensional data 

6. Hamming distance (sometimes edit distance) 
measures the minimum number of substitutions 
required to change one member into another. 






K-MEANS CLUSTERING 


The k-means algorithm is an algorithm to cluster n 
objects based on attributes into k partitions, where 
k< n. 


It assumes that the object attributes form a vector 
space . 






K-MEANS CLUSTERING 


An algorithm for partitioning (or clustering) N 
data points into K disjoint subsets S, 
containing data points so as to minimize the 
sum-of-squares criterion 
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where x n is a vector representing the the n th 
data point and u= is the geometric centroid of 
the data points in S-. 



K-MEANS CLUSTERING 


Simply speaking k-means clustering is an 
algorithm to classify or to group the objects 
based on attributes/features into K number of 
group. 

K is positive integer number. 

The grouping is done by minimizing the sum 
of squares of distances between data and the 
corresponding cluster centroid. 



How the K-Mean Clustering algorithm 

works? 


























K-MEANS CLUSTERING STEPS 


Step 1: Begin with a decision on the value of k = 
number of clusters. 

Step 2 : Put any initial partition that classifies the 
data into k clusters. You may assign the training 

samples randomly,or systematically as the 

following: 

1. Take the first k training sample as single- element 
clusters 

2. Assign each of the remaining (N-k) training 
sample to the cluster with the nearest 
centroid. After each assignment, recompute the 

centroid of the gaining cluster. 



K-MEANS CLUSTERING STEPS 


Step 3: Take each sample in sequence and compute its 
distance from the centroid of each of the clusters. If a 
sample is not currently in the cluster with the closest 
centroid, switch this sample to that cluster and update the 
centroid of the cluster gaining the new sample and the 
cluster losing the sample. 


Step 4 . Repeat step 3 until convergence is achieved, that is 
until a pass through the training sample causes no new 
assignments. 





A Simple example showing the implementation of 

k-means algorithm 
(using K=2) 
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Step 1: 

Initialization : Randomly we choose following two centroids (k=2) 
for two clusters. 

In this case the 2 centroid are: ml =(1.0,1.0) and m2=(5.0,7.0). 
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Step 2: 

• Thus, we obtain two clusters 
containing: 

{1,2,3} and {4,5,6,7}. 

• Their new centroids are: 

m,= (1(1 .0+ 1.5 +3.0),— (1.0 + 2.0 + 4.0}) = (1 .S3,2.33) 

3 3 

?» : = (-( 5 . 0 - 3 . 5 + 4.5 + 3 . 5 )-( 7.0 + 5.0 + 5 . 0 + 4 . 5 )) 

4 4 

= ( 4 . 12 , 5 . 38 ) 
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Step 3: 

• Now using these centroids 
we compute the Euclidean 
distance of each object, as 
shown in table. 


• Therefore, the new clusters 
are: 

{1,2} and {3,4,5,6,7} 

• Next centroids are: 
ml=(l.25,1.5) and m2 = 
(3.9,5.1) 
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Step 4 : 

The clusters obtained are: 
{1,2} and {3,4,5,6,71 

Therefore, there is no change 
in the cluster. 

Thus, the algorithm comes to 
a halt here and final result 
consist of 2 clusters {1,2} and 
{3,4,5,6,7}. 
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PLOT 
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(with l<=3) 
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Weaknesses of K-Mean Clustering 

1. When the numbers of data are not so many, initial grouping 
will determine the cluster significantly. 

2. The number of cluster, K, must be determined before hand. Its 
disadvantage is that it does not yield the same result with 
each run, since the resulting clusters depend on the initial 
random assignments. 

3. We never know the real cluster, using the same data, because 
if it is inputted in a different order it may produce different 
cluster if the number of data is few. 

4. It is sensitive to initial condition. Different initial condition 
may produce different result of cluster. The algorithm may be 
trapped in the local optimum . 



Applications of K-Mean Clustering 

It is relatively efficient and fast. It computes result 
at O(tkn), where n is number of objects or points, k 
is number of clusters and t is number of iterations. 

k-means clustering can be applied to machine 
learning or data mining 

Used on acoustic data in speech understanding to 
convert waveforms into one of k categories (known 
as Vector Quantization or Image Segmentation). 

Also used for choosing color palettes on old 
fashioned graphical display devices and Image 
Quantization. 



CONCLUSION 


K-means algorithm is useful for undirected 
knowledge discovery and is relatively simple. 
K-means has found wide spread usage in lot of 
fields, ranging from unsupervised learning of 
neural network, Pattern recognitions, 
Classification analysis, Artificial intelligence, 
image processing, machine vision, and many 
others. 


